Introduction to Data Visualization with Seaborn

Lecturers: Yashas Roy, Mona Khalil


It is recommended that you take the course Introduction to Data Science in Python prior to this course.

1 Course Description

Seaborn is a powerful Python library that makes it easy to create informative and attractive data visualizations. This 4-hour course provides an introduction to how you can use Seaborn to create a variety of plots, including scatter plots, count plots, bar plots, and box plots, and how you can customize your visualizations.

You’ll explore this library and create Seaborn plots based on a variety of real-world data sets, including exploring how air pollution in a city changes through the day and looking at what young people like to do in their free time. This data will give you the opportunity to find out about Seaborn’s advantages first hand, including how you can easily create subplots in a single figure and how to automatically calculate confidence intervals.

By the end of this course, you’ll be able to use Seaborn in various situations to explore your data and effectively communicate the results of your data analysis to others. These skills are highly sought-after for data analysts, data scientists, and any other job that may involve creating data visualizations.

Course materials can be found here.

2 Introduction to Seaborn

2.1 Lecture: Introduction to Seaborn

2.2 Making a scatter plot with lists

In this exercise, we’ll use a dataset that contains information about 227 countries. This dataset has lots of interesting information on each country, such as the country’s birth rates, death rates, and its gross domestic product (GDP). GDP is the value of all the goods and services produced in a year, expressed as dollars per person.

We’ve created three lists of data from this dataset to get you started. gdp is a list that contains the value of GDP per country, expressed as dollars per person. phones is a list of the number of mobile phones per 1,000 people in that country. Finally, percent_literate is a list that contains the percent of each country’s population that can read and write.

import pandas as pd
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/gdp.csv'
gdp = pd.read_csv(url)['gdp']
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/phones.csv'
phones = pd.read_csv(url)['phones']
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/percent_literate.csv'
percent_literate = pd.read_csv(url)['percent_literate']

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create scatter plot with GDP on the x-axis and number of phones on the y-axis
sns.scatterplot(x = gdp, y = phones)

# Show plot
plt.show()

# Change this scatter plot to have percent literate on the y-axis
sns.scatterplot(x = gdp, y = percent_literate)

# Show plot
plt.show()

While this plot does not show a linear relationship between GDP and percent literate, countries with a lower GDP do seem more likely to have a lower percent of the population that can read and write.

2.3 Making a count plot with a list

In the last exercise, we explored a dataset that contains information about 227 countries. Let’s do more exploration of this data - specifically, how many countries are in each region of the world?

To do this, we’ll need to use a count plot. Count plots take in a categorical list and return bars that represent the number of list entries per category. You can create one here using a list of regions for each country, which is a variable named region.

url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/region.csv'
region = pd.read_csv(url)['region']

# Create count plot with region on the y-axis
sns.countplot(y = region)

# Show plot
plt.show()

Sub-Saharan Africa contains the most countries in this list. We’ll revisit count plots later in the course.

2.4 Lecture: Using pandas with Seaborn

2.5 “Tidy” vs. “untidy” data

Here, we have a sample dataset from a survey of children about their favorite animals. But can we use this dataset as-is with Seaborn? Let’s use pandas to import the csv file with the data collected from the survey and determine whether it is tidy, which is essential to having it work well with Seaborn.

To get you started, the filepath to the csv file has been assigned to the variable csv_filepath.

Note that because csv_filepath is a Python variable, you will not need to put quotation marks around it when you read the csv.

csv_filepath = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/1.2.1_example_csv.csv'

# Create a DataFrame from csv file
df= pd.read_csv(csv_filepath)

# Print the head of df
print(df.head())
##   Unnamed: 0               How old are you?
## 0     Marion                             12
## 1      Elroy                             16
## 2        NaN  What is your favorite animal?
## 3     Marion                            dog
## 4      Elroy                            cat

The DataFrame is untidy, because a single column contains different types of information. Always make sure to check if your DataFrame is tidy before using it with Seaborn.

2.6 Making a count plot with a DataFrame

In this exercise, we’ll look at the responses to a survey sent out to young people. Our primary question here is: how many young people surveyed report being scared of spiders? Survey participants were asked to agree or disagree with the statement “I am afraid of spiders”. Responses vary from 1 to 5, where 1 is “Strongly disagree” and 5 is “Strongly agree”.

url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/young-people-survey-responses.csv'
df = pd.read_csv(url)

# Create a count plot with "Spiders" on the x-axis
sns.countplot(x = "Spiders",data = df)

# Display the plot
plt.show()

This plot shows us that most young people reported not being afraid of spiders.

2.7 Lecture: Adding a third variable with hue

2.8 Hue and scatter plots

In the prior video, we learned how hue allows us to easily make subgroups within Seaborn plots. Let’s try it out by exploring data from students in secondary school. We have a lot of information about each student like their age, where they live, their study habits and their extracurricular activities.

For now, we’ll look at the relationship between the number of absences they have in school and their final grade in the course, segmented by where the student lives (rural vs. urban area).

url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/student_data.csv'
student_data = pd.read_csv(url)

# Create a scatter plot of absences vs. final grade
sns.scatterplot(data = student_data, x = "absences", y = "G3", hue = "location")

# Show plot
plt.show()

# Change the legend order in the scatter plot
sns.scatterplot(x = "absences", y = "G3", 
                data = student_data, 
                hue = "location", hue_order = ['Rural', 'Urban'])

# Show plot
plt.show()

It looks like students with higher absences tend to have lower grades in both rural and urban areas.

2.9 Hue and count plots

Let’s continue exploring our dataset from students in secondary school by looking at a new variable. The “school” column indicates the initials of which school the student attended - either “GP” or “MS”.

In the last exercise, we created a scatter plot where the plot points were colored based on whether the student lived in an urban or rural area. How many students live in urban vs. rural areas, and does this vary based on what school the student attends? Let’s make a count plot with subgroups to find out.

# Create a dictionary mapping subgroup values to colors
palette_colors = {"Rural": "green", "Urban": "blue"}

# Create a count plot of school with location subgroups
sns.countplot(data = student_data, x = "school", hue = "location", palette = palette_colors)

# Display plot
plt.show()

Students at GP tend to come from an urban location, but students at MS are more evenly split.

3 Visualizing Two Quantitative Variables

3.1 Lecture: Introduction to relational plots and subplots

3.2 Creating subplots with col and row

We’ve seen in prior exercises that students with more absences (“absences”) tend to have lower final grades (“G3”). Does this relationship hold regardless of how much time students study each week?

To answer this, we’ll look at the relationship between the number of absences that a student has in school and their final grade in the course, creating separate subplots based on each student’s weekly study time (“study_time”).

# Change to use relplot() instead of scatterplot()
g = sns.relplot(x = "absences", y = "G3", 
                data = student_data, kind = "scatter")

# Show plot
plt.show()

# Change to make subplots based on study time
g = sns.relplot(x = "absences", y = "G3",
                data = student_data,
                kind = "scatter",col = "study_time")

# Show plot
plt.show()

# Change to arrange the plots in rows instead of columns
g = sns.relplot(x = "absences", y = "G3", 
                data = student_data,
                kind = "scatter", row = "study_time")

# Show plot
plt.show()

Because these subplots had a large range of x values, it’s easier to read them arranged in rows instead of columns.

3.3 Creating two-factor subplots

Let’s continue looking at the student_data dataset of students in secondary school. Here, we want to answer the following question: does a student’s first semester grade (“G1”) tend to correlate with their final grade (“G3”)?

There are many aspects of a student’s life that could result in a higher or lower final grade in the class. For example, some students receive extra educational support from their school (“schoolsup”) or from their family (“famsup”), which could result in higher grades. Let’s try to control for these two factors by creating subplots based on whether the student received extra educational support from their school or family.

# Create a scatter plot of G1 vs. G3
g = sns.relplot(data = student_data, x = "G1", y = "G3", kind = "scatter")

# Show plot
plt.show()

# Adjust to add subplots based on school support
g = sns.relplot(x = "G1", y = "G3", 
            data = student_data,
            kind = "scatter", col = "schoolsup", col_order = ["yes", "no"])

# Show plot
plt.show()

# Adjust further to add subplots based on family support
g = sns.relplot(x = "G1", y = "G3", 
                data = student_data,
                kind = "scatter", 
                col = "schoolsup", col_order = ["yes", "no"],
                row = "famsup", row_order = ["yes", "no"])

# Show plot
plt.show()

It looks like the first semester grade does correlate with the final grade, regardless of what kind of support the student received.

3.4 Lecture: Customizing scatter plots

3.5 Changing the size of scatter plot points

In this exercise, we’ll explore Seaborn’s mpg dataset, which contains one row per car model and includes information such as the year the car was made, the number of miles per gallon (“M.P.G.”) it achieves, the power of its engine (measured in “horsepower”), and its country of origin.

What is the relationship between the power of a car’s engine (“horsepower”) and its fuel efficiency (“mpg”)? And how does this relationship vary by the number of cylinders (“cylinders”) the car has? Let’s find out.

Let’s continue to use relplot() instead of scatterplot() since it offers more flexibility.

url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/mpg.csv'
mpg = pd.read_csv(url)

# Create scatter plot of horsepower vs. mpg
g = sns.relplot(x = "horsepower", y = "mpg",
                data = mpg, kind = "scatter", 
                size = "cylinders", hue = "cylinders")

# Show plot
plt.show()

Cars with higher horsepower tend to get a lower number of miles per gallon. They also tend to have a higher number of cylinders.

3.6 Changing the style of scatter plot points

Let’s continue exploring Seaborn’s mpg dataset by looking at the relationship between how fast a car can accelerate (“acceleration”) and its fuel efficiency (“mpg”). Do these properties vary by country of origin (“origin”)?

Note that the “acceleration” variable is the time to accelerate from 0 to 60 miles per hour, in seconds. Higher values indicate slower acceleration.

# Create a scatter plot of acceleration vs. mpg
g = sns.relplot(data = mpg, x = "acceleration", y = "mpg",
                kind = "scatter", hue = "origin", style = "origin")

# Show plot
plt.show()

Cars from the USA tend to accelerate more quickly and get lower miles per gallon compared to cars from Europe and Japan.

3.7 Lecture: Introduction to line plots

3.8 Interpreting line plots

In this exercise, we’ll continue to explore Seaborn’s mpg dataset, which contains one row per car model and includes information such as the year the car was made, its fuel efficiency (measured in “miles per gallon” or “M.P.G”), and its country of origin (USA, Europe, or Japan).

How has the average miles per gallon achieved by these cars changed over time? Let’s use line plots to find out!

# Create line plot
g = sns.relplot(data = mpg, x = "model_year", y = "mpg", kind = "line")

# Show plot
plt.show()

Note that the shaded region represents a confidence interval for the mean, not the distribution of the observations.

3.9 Visualizing standard deviation with line plots

In the last exercise, we looked at how the average miles per gallon achieved by cars has changed over time. Now let’s use a line plot to visualize how the distribution of miles per gallon has changed over time.

# Make the shaded area show the standard deviation
g = sns.relplot(x = "model_year", y = "mpg",
                data = mpg, kind = "line", ci = 'sd')

# Show plot
plt.show()

Unlike the plot in the last exercise, this plot shows us the distribution of miles per gallon for all the cars in each year.

3.10 Plotting subgroups in line plots

Let’s continue to look at the mpg dataset. We’ve seen that the average miles per gallon for cars has increased over time, but how has the average horsepower for cars changed over time? And does this trend differ by country of origin?

# Create line plot of model year vs. horsepower
g = sns.relplot(data = mpg, x = "model_year", y = "horsepower",
                kind = "line", ci = None)

# Show plot
plt.show()

# Change to create subgroups for country of origin
g = sns.relplot(x = "model_year", y = "horsepower", 
                data = mpg, kind = "line", 
                ci = None, style = "origin", hue = "origin")

# Show plot
plt.show()

# Add markers and make each line have the same style
g = sns.relplot(x = "model_year", y = "horsepower", 
                data = mpg, kind = "line",ci = None, style = "origin", 
                hue = "origin", markers = True, dashes = False)

# Show plot
plt.show()

Now that we’ve added subgroups, we can see that this downward trend in horsepower was more pronounced among cars from the USA.

4 Visualizing a Categorical and a Quantitative Variable

4.1 Lecture: Count plots and bar plots

4.2 Count plots

In this exercise, we’ll return to exploring our dataset that contains the responses to a survey sent out to young people. We might suspect that young people spend a lot of time on the internet, but how much do they report using the internet each day? Let’s use a count plot to break down the number of survey responses in each category and then explore whether it changes based on age.

As a reminder, to create a count plot, we’ll use the catplot() function and specify the name of the categorical variable to count (x=____), the pandas DataFrame to use (data=____), and the type of plot (kind=“count”).

url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/survey_data_age.csv'
survey_data = pd.read_csv(url, sep = ';')

# Create count plot of internet usage
g = sns.catplot(data = survey_data, x = "Internet usage", kind = "count")

# Show plot
plt.show()

# Change the orientation of the plot
g = sns.catplot(y = "Internet usage", data = survey_data, kind = "count")

# Show plot
plt.show()

# Separate into column subplots based on age category
g = sns.catplot(y = "Internet usage", data = survey_data,
                kind = "count", col = "Age Category")

# Show plot
plt.show()

It looks like most young people use the internet for a few hours every day, regardless of their age.

4.3 Bar plots with percentages

Let’s continue exploring the responses to a survey sent out to young people. The variable “Interested in Math” is True if the person reported being interested or very interested in mathematics, and False otherwise. What percentage of young people report being interested in math, and does this vary based on gender? Let’s use a bar plot to find out.

As a reminder, we’ll create a bar plot using the catplot() function, providing the name of categorical variable to put on the x-axis (x=____), the name of the quantitative variable to summarize on the y-axis (y=____), the pandas DataFrame to use (data=____), and the type of categorical plot (kind=“bar”).

url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/survey_data_math.csv'
survey_data = pd.read_csv(url)

# Create a bar plot of interest in math, separated by gender
g = sns.catplot(data = survey_data,
                x = "Gender", y = "Interested in Math", kind = 'bar')

# Show plot
plt.show()

When the y-variable is True/False, bar plots will show the percentage of responses reporting True. This plot shows us that males report a much higher interest in math compared to females.

4.4 Customizing bar plots

In this exercise, we’ll explore data from students in secondary school. The “study_time” variable records each student’s reported weekly study time as one of the following categories: “<2 hours”, “2 to 5 hours”, “5 to 10 hours”, or “>10 hours”. Do students who report higher amounts of studying tend to get better final grades? Let’s compare the average final grade among students in each category using a bar plot.

url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/survey_data_age.csv'
survey_data = pd.read_csv(url, sep = ';')

# Create bar plot of average final grade in each study category
g = sns.catplot(data = student_data, x = "study_time", y = "G3", kind = 'bar')

# Show plot
plt.show()

# List of categories from lowest to highest
category_order = ["<2 hours", 
                  "2 to 5 hours", 
                  "5 to 10 hours", 
                  ">10 hours"]

# Rearrange the categories
g = sns.catplot(x = "study_time", y = "G3", data = student_data,
                kind = "bar", order = category_order)

# Show plot
plt.show()

# Hide confidence intervals
g = sns.catplot(x = "study_time", y = "G3", data = student_data,
                kind = "bar", order = category_order, ci = None)

# Show plot
plt.show()

Students in our sample who studied more have a slightly higher average grade, but it’s not a strong relationship.

4.5 Lecture: Box plots

4.6 Create and interpret a box plot

Let’s continue using the student_data dataset. In an earlier exercise, we explored the relationship between studying and final grade by using a bar plot to compare the average final grade (“G3”) among students in different categories of “study_time”.

In this exercise, we’ll try using a box plot look at this relationship instead. As a reminder, to create a box plot you’ll need to use the catplot() function and specify the name of the categorical variable to put on the x-axis (x=____), the name of the quantitative variable to summarize on the y-axis (y=____), the pandas DataFrame to use (data=____), and the type of plot (kind=“box”).

# Specify the category ordering
study_time_order = ["<2 hours", "2 to 5 hours", 
                    "5 to 10 hours", ">10 hours"]

# Create a box plot and set the order of the categories
g = sns.catplot(data = student_data, x = "study_time", y = "G3",
                kind = 'box', order = study_time_order)

# Show plot
plt.show()

4.7 Omitting outliers

Now let’s use the student_data dataset to compare the distribution of final grades (“G3”) between students who have internet access at home and those who don’t. To do this, we’ll use the “internet” variable, which is a binary (yes/no) indicator of whether the student has internet access at home.

Since internet may be less accessible in rural areas, we’ll add subgroups based on where the student lives. For this, we can use the “location” variable, which is an indicator of whether a student lives in an urban (“Urban”) or rural (“Rural”) location.

# Create a box plot with subgroups and omit the outliers
g = sns.catplot(data = student_data, x = "internet", y = "G3",
                kind = 'box', hue = "location", showfliers = False)

# Show plot
plt.show()

The median grades are quite similar between each group, but the spread of the distribution looks larger among students who have internet access.

4.8 Adjusting the whiskers

In the lesson we saw that there are multiple ways to define the whiskers in a box plot. In this set of exercises, we’ll continue to use the student_data dataset to compare the distribution of final grades (“G3”) between students who are in a romantic relationship and those that are not. We’ll use the “romantic” variable, which is a yes/no indicator of whether the student is in a romantic relationship.

Let’s create a box plot to look at this relationship and try different ways to define the whiskers.

# Set the whiskers to 0.5 * IQR
g = sns.catplot(x = "romantic", y = "G3",
                data = student_data, kind = "box")

# Show plot
plt.show()

# Extend the whiskers to the 5th and 95th percentile
g = sns.catplot(x = "romantic", y = "G3", data = student_data,
                kind = "box", whis = [5,95])

# Show plot
plt.show()

# Set the whiskers at the min and max values
g = sns.catplot(x = "romantic", y = "G3", data = student_data,
                kind = "box", whis = [0, 100])

# Show plot
plt.show()

The median grade is the same between these two groups, but the max grade is higher among students who are not in a romantic relationship.

4.9 Lecture: Point plots

4.10 Customizing point plots

Let’s continue to look at data from students in secondary school, this time using a point plot to answer the question: does the quality of the student’s family relationship influence the number of absences the student has in school? Here, we’ll use the “famrel” variable, which describes the quality of a student’s family relationship from 1 (very bad) to 5 (very good).

As a reminder, to create a point plot, use the catplot() function and specify the name of the categorical variable to put on the x-axis (x=____), the name of the quantitative variable to summarize on the y-axis (y=____), the pandas DataFrame to use (data=____), and the type of categorical plot (kind=“point”).

# Create a point plot of family relationship vs. absences
g = sns.catplot(data = student_data, kind = 'point',
                x = "famrel", y = "absences")

# Show plot
plt.show()

# Add caps to the confidence interval
g = sns.catplot(x = "famrel", y = "absences", data = student_data,
                kind = "point", capsize = 0.2)
        
# Show plot
plt.show()

# Remove the lines joining the points
g = sns.catplot(x = "famrel", y = "absences", data = student_data,
                kind = "point", capsize = 0.2, join = False)
            
# Show plot
plt.show()

While the average number of absences is slightly smaller among students with higher-quality family relationships, the large confidence intervals tell us that we can’t be sure there is an actual association here.

4.11 Point plots with subgroups

Let’s continue exploring the dataset of students in secondary school. This time, we’ll ask the question: is being in a romantic relationship associated with higher or lower school attendance? And does this association differ by which school the students attend? Let’s find out using a point plot.

# Create a point plot that uses color to create subgroups
g = sns.catplot(data = student_data, kind = 'point',
                x = "romantic", y = "absences", hue = "school")

# Show plot
plt.show()

# Turn off the confidence intervals
g = sns.catplot(x = "romantic", y = "absences", data = student_data,
                kind = "point", hue = "school", ci = None)

# Show plot
plt.show()

# Import median function from numpy
from numpy import median

# Plot the median number of absences instead of the mean
g = sns.catplot(x = "romantic", y = "absences", data = student_data,
                kind = "point", hue = "school", ci = None, estimator = median)

# Show plot
plt.show()

It looks like students in romantic relationships have a higher average and median number of absences in the GP school, but this association does not hold for the MS school.

5 Customizing Seaborn Plots

5.1 Lecture: Changing plot style and color

5.2 Changing style and palette

Let’s return to our dataset containing the results of a survey given to young people about their habits and preferences. We’ve provided the code to create a count plot of their responses to the question “How often do you listen to your parents’ advice?”. Now let’s change the style and palette to make this plot easier to interpret.

url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/survey_data_parent.csv'
survey_data = pd.read_csv(url)

# Set the style to "whitegrid"
sns.set_style("whitegrid")

# Create a count plot of survey responses
category_order = ["Never", "Rarely", "Sometimes", 
                  "Often", "Always"]

g = sns.catplot(x = "Parents Advice", data = survey_data, 
                kind = "count", order = category_order)

# Show plot
plt.show()

# Set the color palette to "Purples"
sns.set_style("whitegrid")


# Create a count plot of survey responses
category_order = ["Never", "Rarely", "Sometimes", 
                  "Often", "Always"]

g = sns.catplot(x = "Parents Advice", data = survey_data, 
                kind = "count", order = category_order, palette = "Purples")

# Show plot
plt.show()

# Change the color palette to "RdBu"
sns.set_style("whitegrid")
sns.set_palette("RdBu")

# Create a count plot of survey responses
category_order = ["Never", "Rarely", "Sometimes", 
                  "Often", "Always"]

g = sns.catplot(x = "Parents Advice", data = survey_data, 
                kind = "count", order = category_order)

# Show plot
plt.show()

This style and diverging color palette best highlights the difference between the number of young people who usually listen to their parents’ advice versus those who don’t.

5.3 Changing the scale

In this exercise, we’ll continue to look at the dataset containing responses from a survey of young people. Does the percentage of people reporting that they feel lonely vary depending on how many siblings they have? Let’s find out using a bar plot, while also exploring Seaborn’s four different plot scales (“contexts”).

# Set the context to "paper"
sns.set_context("paper")

# Create bar plot
g = sns.catplot(x = "Siblings", y = "Loneliness",
                data = survey_data, kind = "bar")

# Show plot
plt.show()

# Change the context to "notebook"
sns.set_context("notebook")

# Create bar plot
g = sns.catplot(x = "Siblings", y = "Loneliness",
                data = survey_data, kind = "bar")

# Show plot
plt.show()

# Change the context to "talk"
sns.set_context("talk")

# Create bar plot
g = sns.catplot(x = "Siblings", y = "Loneliness",
                data = survey_data, kind = "bar")

# Show plot
plt.show()

# Change the context to "poster"
sns.set_context("poster")

# Create bar plot
g = sns.catplot(x = "Siblings", y = "Loneliness",
                data = survey_data, kind = "bar")

# Show plot
plt.show()

Each context name gives Seaborn’s suggestion on when to use a given plot scale (in a paper, in an iPython notebook, in a talk/presentation, or in a poster session).

5.4 Using a custom palette

So far, we’ve looked at several things in the dataset of survey responses from young people, including their internet usage, how often they listen to their parents, and how many of them report feeling lonely. However, one thing we haven’t done is a basic summary of the type of people answering this survey, including their age and gender. Providing these basic summaries is always a good practice when dealing with an unfamiliar dataset.

The code provided will create a box plot showing the distribution of ages for male versus female respondents. Let’s adjust the code to customize the appearance, this time using a custom color palette.

# Set the style to "darkgrid"
sns.set_style("darkgrid")

# Set a custom color palette
sns.set_palette(["#39A7D0", "#36ADA4"])

# Create the box plot of age distribution by gender
g = sns.catplot(x = "Gender", y = "Age", 
                data = survey_data, kind = "box")

# Show plot
plt.show()

It looks like the median age is the same for males and females, but distribution of females skews younger than the males.

5.5 Lecture: Adding titles and labels: Part 1

5.6 FacetGrids vs. AxesSubplots

In the recent lesson, we learned that Seaborn plot functions create two different types of objects: FacetGrid objects and AxesSubplot objects. The method for adding a title to your plot will differ depending on the type of object it is.

In the code provided, we’ve used relplot() with the miles per gallon dataset to create a scatter plot showing the relationship between a car’s weight and its horsepower. This scatter plot is assigned to the variable name g. Let’s identify which type of object it is.

# Create scatter plot
g = sns.relplot(x = "weight", y = "horsepower", 
                data = mpg, kind = "scatter")

# Identify plot type
type_of_g = type(g)

# Print type
print(type_of_g)
## <class 'seaborn.axisgrid.FacetGrid'>

Beside relplot(), catplot() supports creating subplots, so it also creates a FacetGrid object.

5.7 Adding a title to a FacetGrid object

In the previous exercise, we used relplot() with the miles per gallon dataset to create a scatter plot showing the relationship between a car’s weight and its horsepower. This created a FacetGrid object. Now that we know what type of object it is, let’s add a title to this plot.

# Create scatter plot
g = sns.relplot(x = "weight", y = "horsepower", 
                data = mpg, kind = "scatter")

# Add a title "Car Weight vs. Horsepower"
g.fig.suptitle("Car Weight vs. Horsepower")

# Show plot
plt.show()

It looks like a car’s weight is positively correlated with its horsepower.

5.8 Lecture: Adding titles and labels: Part 2

5.9 Adding a title and axis labels

Let’s continue to look at the miles per gallon dataset. This time we’ll create a line plot to answer the question: How does the average miles per gallon achieved by cars change over time for each of the three places of origin? To improve the readability of this plot, we’ll add a title and more informative axis labels.

In the code provided, we create the line plot using the lineplot() function. Note that lineplot() does not support the creation of subplots, so it returns an AxesSubplot object instead of an FacetGrid object.

url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Visualization%20with%20Seaborn/mpg_mean.csv'
mpg_mean = pd.read_csv(url, sep = ';')

# Create line plot
g = sns.lineplot(x = "model_year", y = "mpg_mean", 
                 data = mpg_mean, hue = "origin")

# Add a title "Average MPG Over Time"
g.set_title("Average MPG Over Time")

# Add x-axis and y-axis labels
g.set(xlabel = "Car Model Year",ylabel = "Average MPG")


# Show plot
plt.show()

5.10 Rotating x-tick labels

In this exercise, we’ll continue looking at the miles per gallon dataset. In the code provided, we create a point plot that displays the average acceleration for cars in each of the three places of origin. Note that the “acceleration” variable is the time to accelerate from 0 to 60 miles per hour, in seconds. Higher values indicate slower acceleration.

Let’s use this plot to practice rotating the x-tick labels. Recall that the function to rotate x-tick labels is a standalone Matplotlib function and not a function applied to the plot object itself.

# Create point plot
g = sns.catplot(x = "origin", y = "acceleration", data = mpg, 
                kind = "point", join = False, capsize = 0.1)

# Rotate x-tick labels
plt.xticks(rotation = 90)
## ([0, 1, 2], [Text(0, 0, 'usa'), Text(1, 0, 'japan'), Text(2, 0, 'europe')])
# Show plot
plt.show()

5.11 Lecture: Putting it all together

5.12 Box plot with subgroups

In this exercise, we’ll look at the dataset containing responses from a survey given to young people. One of the questions asked of the young people was: “Are you interested in having pets?” Let’s explore whether the distribution of ages of those answering “yes” tends to be higher or lower than those answering “no”, controlling for gender.

# Set palette to "Blues"
sns.set_palette("Blues")

# Adjust to add subgroups based on "Pets"
g = sns.catplot(x = "Gender", y = "Age", data = survey_data, 
                kind = "box", hue = "Pets")

# Set title to "Age of Those Interested in Pets vs. Not"
g.fig.suptitle("Age of Those Interested in Pets vs. Not")

# Show plot
plt.show()

5.13 Bar plot with subgroups and subplots

In this exercise, we’ll return to our young people survey dataset and investigate whether the proportion of people who like techno music (“Likes Techno”) varies by their gender (“Gender”) or where they live (“Village - town”). This exercise will give us an opportunity to practice the many things we’ve learned throughout this course!

# Set the figure style to "dark"
sns.set_style("dark")

# Adjust to add subplots per gender
g = sns.catplot(x = "Village - town", y = "Techno", data = survey_data,
                kind = "bar", col = "Gender")

# Add title and axis labels
g.fig.suptitle("Percentage of Young People Who Like Techno")
g.fig.subplots_adjust(top = 0.9)
f = g.set(xlabel = "Location of Residence",
          ylabel = "% Who Like Techno")

# Show plot
plt.show()

6 Course Recap

Congratulations on completing the course! More courses, tracks and instructions can be found here. Happy learning!